R is a programming language designed to help you perform
statistical analysis, create graphics, and later on write your own
statistical software. R is becoming increasingly popular
and knowledge of R will help you on the job market. R is
probably the most versatile statistical tool out there (and it’s free
and open-source so you can literally use it anywhere). It is for example
used in all fields of academia, from biology to economics, and outside
academia including
RStudio is a great graphical user interface for R. In
recent years, a growing number of features have been added to this
graphical user interface, which makes it the preferred choice for
learning R, especially among beginners. You can think about
it as R being the engine of the car and RStudio being the
dashboard.
RStudio projects make it straightforward to divide your work into multiple contexts, each with its own working directory, workspace, history, and source documents. A project is basically a folder on your computer that holds all the files relevant to a particular piece of work. Working in RStudio Projects has multiple advantages:
R session
(process) is started. This makes sure that things you do in different
projects do not mess up.Git is a version control system that makes it easy to track changes
and work on code collaboratively. GitHub is a hosting service for
git. You can think of it as a public Dropbox for code but
on steroids. With version control, you will build your projects
step-by-step, be able to come back to any version of the project, and
accompany everything with human-readable messages.
As a student, you even get unlimited private repositories which you can use if you don’t feel like sharing your code with the rest of the world (yet). We will use private repositories to distribute code and assignments to you. And you will use it to keep track of your code and collaborate in teams.
With git, writing code for a project will look somewhat like this:
A Git repository is a space where you store and manage a project. It contains all of your project’s files and stores each file’s revision history. It’s common to refer to a repository as a repo.
We will you one repository for each lab and one repository for each homework assignment. You can directly import (“pull”) repositories via RStudio and save them on your computer. If you changed something in your project, you can easily upload (“push”) the new version to GitHub. GitHub will keep track of all changes you made over time within your project.
Our workflow will appear a bit tricky at the beginning but we are sure that you will handle it with ease very soon. We assume that by now you downloaded and installed R and Rstudio and have your personal GitHub account.
The course has its own page on GitHub, you can find it here: https://github.com/uni-mannheim-qm-2022. This is the place where you can find all relevant material for the lab sessions. It is also the place where you download and hand in your homework assignments.
So how does this work?
Go to https://github.com/uni-mannheim-qm-2022
and click on the repository for the current week (this week, this is
called week01_introduction). Now, click on the green
Clone or download button and select Use
HTTPS (this might already be selected by default, and if it is,
you’ll see the text Clone with HTTPS as in the image below). Click on
the clipboard icon to copy the repo URL.
File on the top bar and select
New Project....Version Control.Git.Repository URL window. Click on Browse to
select the folder on your computer where you want to store the
project.Create Project..Rmd file that is stored in the project (in
week 1, this is called QM2022_Week01.Rmd).The RStudio interface has four panes:
Enough preparation, let’s finally dive into R!
R can perform basic math operations. Here are some examples:
1 + 1
[1] 2
Some more calculations:
2 - 3
[1] -1
4 * 5
[1] 20
2^2
[1] 4
4 / 2
[1] 2
2^(1 / 2)
[1] 1.414214
If you place parentheses correctly, R incorporates the order of operations.
((2 + 2) * 2)^2
[1] 64
This should give the same result as before.
(4 * 2)^2
[1] 64
But this of course gives a different result:
(2 + 2 * 2)^2
[1] 36
You can also use other math functions you know from your calculator:
this is \(\sqrt{2}\)
sqrt(2)
[1] 1.414214
when you do not specify the base, R uses the natural log with base \(e\), i.e. \(\log_e(10)\)
log(10)
[1] 2.302585
but R can also use a different (virtually any) base, e.g. \(\log_{10}(10)\)
log(10, base = 10)
[1] 1
or with base = 5, i.e. \(\log_5(10)\)
log(10, 5)
[1] 1.430677
Pro tip: Always close your parentheses!
It is hard to understand pure code, especially for someone who did not write it (and future-you will also have a hard time to understand it).
Pro tip: Add comments to your code, describing what you are doing and why you are doing it.
With comments:
# symbol,# will be commented
out.# this is a comment
1 + 1 # This line runs
[1] 2
# 1 + 1 This line does not run
Good coding style is like using correct punctuation.
Youcanmanagewithoutitbutitsuremakesthingseasiertoread.. – Hadley Wickham
But I already do have a calculator. Why do I need R?
R is so much more! R is an object-oriented programming language.
<- as assignment
operatorExamples:
lucky_number <-
# Now we created an (numeric) object called "lucky_number"
lucky_number
Error: object 'lucky_number' not found
The class() command lets us check the type of an
object:
lucky_number <-
class(lucky_number)
Error: object 'lucky_number' not found
Let’s see how this works live, this time with a character object:
firstname <- "" # This is a character object
firstname
[1] ""
class(firstname)
[1] "character"
lastname <- ""
lastname
[1] ""
What kind of data can I store in R? Different types of objects that can contain different types and sets of data:
We will go through all of these object types below. On top of that we will also learn how to calculate the measures of central tendency and variability with vectors.
Let’s start with vectors. We want a vector of the numbers 1, 2, 3, 4 and 5. How do I assign this set of numbers to a vector?
The c() function
combines single values to a vector:
example_vec <- c(1, 2, 3, 4, 5)
example_vec
[1] 1 2 3 4 5
This also works for characters/strings:
country_code <- c("DE", "FR", "NL", "US", "UK")
country_code
[1] "DE" "FR" "NL" "US" "UK"
And it also works for a combination of numbers and characters:
example_vec2 <- c("Welcome", "to", "the", "lab", "in", "A", 5)
example_vec2
[1] "Welcome" "to" "the" "lab" "in" "A" "5"
What if we start with numbers?
example_vec3 <- c(1, 2, 3, 4, 5, "R can count!")
example_vec3
[1] "1" "2" "3" "4" "5" "R can count!"
Note that if you have a character field in your vector, R will turn ALL values into character data! (You can see that by the quotes around the values)
Let’s check the type of data by using the class()
command on example_vec3.
example_vec3 <- c(1, 2, 3, 4, 5, "R can count!")
class(example_vec3)
[1] "character"
You can use mathematical functions on each element in numeric vectors/matrices etc.
example_vec <- c(1, 2, 3, 4, 5)
sqrt(example_vec) # Take the square root of each element in example_vec
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
What about multiplication?
example_vec <- c(1, 2, 3, 4, 5)
example_vec * 10
[1] 10 20 30 40 50
There are also some functions that you can use on the whole vector.
example_vec <- c(1, 2, 3, 4, 5)
sum(example_vec) # Question: What does sum() do?
[1] 15
length(example_vec) # Question: What does length() do?
[1] 5
Matrices in R are two-dimensional table objects. In R, matrices are always row by column. Like roller coaster, Roman Catholic or Ray Charles).
In a matrix, all data must be of the same type. If you mix numeric and character entries, the matrix will be all character just like in a vector.
How do I create a matrix in R?
example_mat1 <- matrix(c(1, 2, 3, 4, 5, 6),
nrow = 3,
ncol = 2
)
example_mat1 # How did R fill the numbers in the matrix?
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
You could also change the options an let R fill the matrix by rows (instead of columns):
example_mat2 <- matrix(c(1, 2, 3, 4, 5, 6),
nrow = 3,
ncol = 2,
byrow = T
)
example_mat2 # See the difference?
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
Or you could create a matrix from different vectors by using
column-bind on two or more vectors. It works similar to the
c() function but with vectors as input instead of
scalars.
Let’s first create two vectors of the same length:
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(7, 8, 9, 10, 11, 12)
# And now column-bind - cbind() - the two vectors.
example_mat3 <- cbind(vec1, vec2)
example_mat3
vec1 vec2
[1,] 1 7
[2,] 2 8
[3,] 3 9
[4,] 4 10
[5,] 5 11
[6,] 6 12
Similarly, we can row-bind – rbind() – the two
vectors:
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(7, 8, 9, 10, 11, 12)
example_mat4 <- rbind(vec1, vec2)
example_mat4
[,1] [,2] [,3] [,4] [,5] [,6]
vec1 1 2 3 4 5 6
vec2 7 8 9 10 11 12
Data frames are two-dimensional table objects, just like matrices. Most data you will analyze in R will be in this form.
You can create data frames from vectors just like
cbind() using data.frame():
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(7, 8, 9, 10, 11, 12)
example_df1 <- data.frame(vec1, vec2)
example_df1
However, data frames are always column-bound vectors. In a data frame, everything within a column has to be of the same data type. But you can mix data types between columns:
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(7, 8, 9, 10, 11, 12)
vec3 <-
c(
"First Row",
"Second Row",
"Third Row",
"Fourth Row",
"Fifth Row",
"Sixth Row"
)
example_df2 <- data.frame(vec1, vec2, vec3)
example_df2
You can also name your columns/variables. Either when creating your data frame:
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(7, 8, 9, 10, 11, 12)
vec3 <-
c(
"First Row",
"Second Row",
"Third Row",
"Fourth Row",
"Fifth Row",
"Sixth Row"
)
example_df3 <- data.frame(
variable_1to6 = vec1,
variable_7to12 = vec2,
variable_rows = vec3
)
example_df3
Or by renaming an existing data frame.
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(7, 8, 9, 10, 11, 12)
vec3 <-
c(
"First Row",
"Second Row",
"Third Row",
"Fourth Row",
"Fifth Row",
"Sixth Row"
)
example_df3 <- data.frame(vec1, vec2, vec3)
# Rename the variables of an existing data frame
names(example_df3) <- c("variable.1", "variable.2", "variable.3")
example_df3
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(7, 8, 9, 10, 11, 12)
vec3 <-
c(
"First Row",
"Second Row",
"Third Row",
"Fourth Row",
"Fifth Row",
"Sixth Row"
)
example_df3 <- data.frame(vec1, vec2, vec3)
names(example_df3) <- c("variable.1", "variable.2", "variable.3")
We can also add a new variable to an existing data frame. We simply create a data frame which consists of a data frame and a vector:
example_df4 <-
data.frame(example_df3,
variable_4 = c(90, 91, 92, 93, 94, 95))
example_df4
These are like matrices, except that they are typically three-dimensional. You’re not going to see many of these, but we’ll introduce them for completeness. Here is an illustration of what a three-dimensional array could ook like:
You can think of 10 3 x 5 bingo cards, that all display spaces 1 through 15 for example, as an array. If I were to display that in R, I would use the array function to write:
bingo_array <- array(seq(1, 15, 1),
dim = c(3, 5, 10))
bingo_array
The general syntax for this function is
array(values you want to array, dim = (row, column, height)).
List objects can contain a series of the other objects we just learned about. A single list can contain a value, a vector, a matrix, AND a dataframe - or many of each!
How do I make a list?
Use the list()
function!
# create a vector
example_vec <- c(1, 2, 3, 4, 5, 6, 7, 8)
# create a matrix
example_mat <- matrix(c(1, 2, 3, 4, 5, 6),
nrow = 3,
ncol = 2)
# create an array
example_array <- array(seq(1, 15, 1), dim = c(3, 5, 10))
example_vec3 <- c(1, 2, 3, 4)
## Store all objects in a list
example_list <- list(example_vec, example_mat, example_array)
example_list
Sometimes we want to select single or multiple data entries from our
objects. We can do this by selecting elements via [].
Let’s first do it with a vector. Remember our country_code vector?
country_code <- c("DE", "FR", "NL", "US", "UK")
country_code
[1] "DE" "FR" "NL" "US" "UK"
country_code <- c("DE", "FR", "NL", "US", "UK")
Let’s say we only want to select the “US”. We can achieve this by accessing the value via its position in the vector:
country_code[4]
[1] "US"
Now we want to select all values but the “US”:
country_code[-4]
[1] "DE" "FR" "NL" "UK"
You can pass multiple indexes as a vector:
country_code[c(1, 2, 3)]
[1] "DE" "FR" "NL"
1:3 generates the vector c(1, 2, 3) a bit
quicker.
country_code[1:3]
[1] "DE" "FR" "NL"
If we want the values “DE”, “FR”, and “US”, a sequence does not really help. But we can put a vector with a combination of a sequence and some other values in the square brackets:
country_code[c(1:2, 4)]
[1] "DE" "FR" "US"
We can access values of a matrix similarly. However, we need to think of one additional dimension.
example_mat <- matrix(c(1, 2, 3, 4, 5, 6),
nrow = 3,
ncol = 2)
example_mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Generally, we type object[row, column] to access
specific rows and columns. To see how this works, let’s have a look at
our example_mat:
Now we want to access the value 6. It’s in the third row and the second column.
example_mat[3, 2]
[1] 6
We could also select an entire column (and use it like a vector).
example_mat[, 2]
[1] 4 5 6
By accessing values with the [] square brackets, we
could also replace values. Let’s say we want to recode the entire first
column in example_mat3 to 99:
example_mat[, 1] <- 99
example_mat
[,1] [,2]
[1,] 99 4
[2,] 99 5
[3,] 99 6
example_mat <- matrix(c(1, 2, 3, 4, 5, 6),
nrow = 3,
ncol = 2)
example_mat[, 1] <- 99
# And we want to recode the first and the third value in the second column
# to 91 and 100
example_mat[c(1, 3), 2] <- c(91, 100)
example_mat
[,1] [,2]
[1,] 99 91
[2,] 99 5
[3,] 99 100
This is a good start to select and recode data in an object. However, it might be a bit exhausting (maybe even impossible) to always look up the exact position in the object.
Luckily, R let’s us also select elements based on conditions. Instead of the position we put a condition in the [] square brackets.
==!=<><=>=&|So how do conditions work? Let’s create a matrix to work with
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(7, 8, 9, 10, 11, 12)
# And now column-bind (cbind()) the two vectors.
example_mat <- cbind(vec1, vec2)
example_mat
vec1 vec2
[1,] 1 7
[2,] 2 8
[3,] 3 9
[4,] 4 10
[5,] 5 11
[6,] 6 12
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(7, 8, 9, 10, 11, 12)
# And now column-bind (cbind()) the two vectors.
example_mat <- cbind(vec1, vec2)
example_mat > 9 # This returns TRUE or FALSE for each value in the object.
vec1 vec2
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
Now if we put this condition in square brackets we get the values for which the condition is true.
example_mat[example_mat > 9]
[1] 10 11 12
Working with data frames is similar to working with matrices and vectors.
Usually (and especially for this class) we want to work with existing
datasets. R knows and can load most of the standard formats of datasets,
like .csv, .xlsx (Excel), .dta
(Stata), .sav (SPSS) and many more.
So far we only used R’s base functions. In order to use some more sophisticated or special R functions, we need to load libraries or packages first. Think of these libraries as extra apps that you can load on your smartphones to extend its functionality.
Right now, we want to load the dataset. In order to use the standard but foreign datasets we need the foreign package.
First, we want to have a look at what the package can do.
packageDescription("foreign")
Package: foreign
Priority: recommended
Version: 0.8-81
Date: 2020-12-22
Title: Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
Depends: R (>= 4.0.0)
Imports: methods, utils, stats
Authors@R: c( person("R Core Team", email = "R-core@R-project.org", role = c("aut", "cph", "cre")),
person("Roger", "Bivand", role = c("ctb", "cph")), person(c("Vincent", "J."), "Carey", role
= c("ctb", "cph")), person("Saikat", "DebRoy", role = c("ctb", "cph")), person("Stephen",
"Eglen", role = c("ctb", "cph")), person("Rajarshi", "Guha", role = c("ctb", "cph")),
person("Swetlana", "Herbrandt", role = "ctb"), person("Nicholas", "Lewin-Koh", role =
c("ctb", "cph")), person("Mark", "Myatt", role = c("ctb", "cph")), person("Michael",
"Nelson", role = "ctb"), person("Ben", "Pfaff", role = "ctb"), person("Brian", "Quistorff",
role = "ctb"), person("Frank", "Warmerdam", role = c("ctb", "cph")), person("Stephen",
"Weigand", role = c("ctb", "cph")), person("Free Software Foundation, Inc.", role = "cph"))
Contact: see 'MailingList'
Copyright: see file COPYRIGHTS
Description: Reading and writing data stored by some versions of 'Epi Info', 'Minitab', 'S', 'SAS',
'SPSS', 'Stata', 'Systat', 'Weka', and for reading and writing some 'dBase' files.
ByteCompile: yes
Biarch: yes
License: GPL (>= 2)
BugReports: https://bugs.r-project.org
MailingList: R-help@r-project.org
URL: https://svn.r-project.org/R-packages/trunk/foreign/
NeedsCompilation: yes
Packaged: 2020-12-22 13:59:32 UTC; hornik
Author: R Core Team [aut, cph, cre], Roger Bivand [ctb, cph], Vincent J. Carey [ctb, cph], Saikat DebRoy
[ctb, cph], Stephen Eglen [ctb, cph], Rajarshi Guha [ctb, cph], Swetlana Herbrandt [ctb],
Nicholas Lewin-Koh [ctb, cph], Mark Myatt [ctb, cph], Michael Nelson [ctb], Ben Pfaff [ctb],
Brian Quistorff [ctb], Frank Warmerdam [ctb, cph], Stephen Weigand [ctb, cph], Free Software
Foundation, Inc. [cph]
Maintainer: R Core Team <R-core@R-project.org>
Repository: CRAN
Date/Publication: 2020-12-22 14:59:20 UTC
Built: R 4.1.2; x86_64-apple-darwin17.0; 2021-11-01 20:59:13 UTC; unix
-- File: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/foreign/Meta/package.rds
# Ok this seems to be useful. So let's load the package to use it.
library(foreign)
You will often come across datasets which are stored as Stata data
files. Those files have the extension .dta.
Right now, we want to load the data set called
weather_data_germany_2021.dta which is already stored the
raw_data folder in our directory:
weather_data <- read.dta("raw_data/weather_data_germany_2021.dta")
The data contains yearly temperature averages of German cities as well as their geographical location (longitude and latitude). It comes from the “Deutscher Wetterdienst” and you can find it here. Now that we have loaded the data, we can have a look at it.
With head()we can look at the first six rows of the data
set:
head(weather_data)
But we can also look at the entire data set:
weather_data
If we only want to look at the variable names, we can use
names():
names(weather_data)
[1] "city" "longitude" "latitude" "mean_temp"
Now we can use our selecting abilities on a data frame. As before we can select elements via their numeric position:
weather_data[1, 2] # first row, second column
[1] 9.387966
weather_data[1:3, 1] # rows 1-3, first column
[1] "Wacken" "Hasenkrug-Hardebek" "Muskau, Bad"
Additionally, as columns usually have names in data frames, we can use the column names to select values in two ways.
First, we can put the column name in square brackets instead of a column number:
weather_data[1, "city"]
[1] "Wacken"
weather_data[, "mean_temp"]
We can also look at two variables at once:
weather_data[, c("city", "mean_temp")]
Second, we can also select an entire column by using the
$ operator with the column name:
data.frame_name$column_name. Just like this:
weather_data$mean_temp
[1] 9.48 9.35 9.29 8.13 10.54 3.61 10.31 8.98 9.04 11.17 9.63 9.83 7.52 9.27 8.49 8.98 9.46 9.54
[19] 8.41 10.01 8.95 9.88 8.89 9.43 8.94 9.81 9.92 8.75 7.13 8.87 9.77 9.53 9.59 10.22 9.67 9.41
[37] 10.16 10.29 6.65 7.47 9.44 10.13 8.23 8.51 9.69 10.45 8.37 6.50 9.45 9.73 9.66 9.52 10.67 7.33
[55] 9.33 5.29 10.02 5.38 8.26 10.70 9.17 8.75 7.00 9.12 9.79 7.21 8.53 8.82 9.03 7.41 9.77 9.54
[73] 8.29 9.85 8.51 9.88 8.66 8.61 8.39 7.92 10.21 9.66 9.80 9.95 10.15 10.23 8.61 9.43 10.24 9.95
[91] 10.40 9.42 8.28 7.52 9.23 8.26 8.42 9.76 10.11 9.13 9.71 9.53 9.59 10.00 9.16 5.85 9.95 10.75
[109] 6.63 10.58 9.27 9.70 9.56 10.49 5.75 9.31 9.07 9.76 8.56 9.71 9.92 7.74 9.28 9.69 8.34 9.74
[127] 7.11 8.18 8.84 7.94 9.64 10.11 10.78 9.43 7.91 8.91 10.98 9.33 7.47 9.31 10.35 8.95 8.93 7.54
[145] 9.68 8.36 9.06 9.57 8.85 9.48 9.62 9.22 9.90 9.42 7.92 10.31 6.77 7.28 9.63 9.43 8.65 9.43
[163] 8.69 10.52 8.49 10.15 9.69 8.77 9.74 9.66 8.57 8.81 9.18 8.16 10.31 7.02 9.27 9.28 8.89 8.93
[181] 9.13 7.56 9.30 9.04 8.80 9.92 8.24 7.94 9.81 8.68 8.66 10.37 9.05 9.78 9.69 8.56 8.46 8.79
[199] 9.05 7.60 8.40 8.41 9.54 9.61 8.75 8.85 9.24 8.12 9.36 9.43 7.77 9.38 9.34 9.73 9.17 6.18
[217] 9.44 8.78 7.20 8.39 9.78 9.77 10.19 9.92 9.47 8.95 10.14 8.90 9.74 7.79 8.69 9.30 8.30 9.48
[235] 7.76 8.62 10.52 9.65 9.23 8.77 10.06 9.34 8.72 8.60 10.85 10.66 10.59 9.05 9.61 8.06 7.25 8.02
[253] 3.78 10.64 9.04 8.98 9.68 9.81 8.45 9.53 9.92 8.45 9.18 9.77 9.70 9.81 10.12 7.78 9.90 8.50
[271] 9.48 9.14 10.40 9.84 10.16 9.07 10.85 8.63 -4.05 8.87 10.20 9.15 9.13 8.87 7.90 9.57 7.11 8.62
[289] 10.00 9.64 8.95 9.58 9.81 8.58 9.98 8.82 9.59 9.09 10.02 9.33 10.24 10.58 10.32 9.40 8.22 10.25
[307] 6.11 7.44 6.39 6.91 8.16 9.56 8.97 9.35 9.49 9.38 8.36 10.33 7.57 9.57 9.48 8.15 10.16 8.52
[325] 8.18 5.74 10.84 10.42 9.97 10.22 9.84 9.66 10.87 10.99 9.69 8.56 8.29 8.23 9.42 7.15 8.72 9.16
[343] 8.69 6.83 9.68 8.77 9.22 8.94 8.16 9.10 10.01 9.39 9.17 9.53 10.55 11.37 9.92 8.26 9.50 8.62
[361] 8.63 6.49 9.89 7.87 8.62 8.90 9.56 9.66 8.68 8.90 10.06 9.50 9.83 9.52 6.90 9.35 5.09 9.90
[379] 9.83 8.91 10.22 8.67 9.80 10.12 8.65 9.54 8.98 8.75 9.30 3.81 8.71 9.13 8.26 10.88 8.47 9.13
[397] 9.61 8.49 9.36 9.71 8.45 10.23 6.90 9.29 7.41 9.57 8.02 10.48 9.18 8.68 9.88 9.50 9.87 9.93
[415] 7.37 7.96 9.16 9.66 9.25 8.57 9.98 9.34 7.59 8.37 9.23 9.84 7.94 9.44 8.87 9.42 10.13 8.70
[433] 9.44 9.40 8.26 8.81 8.91 9.76 10.07 8.02 10.76 8.91 10.61 9.32 9.19 9.21 9.31 8.62 9.04 9.35
[451] 8.75 9.96 9.75 10.06 9.71 7.16 9.79 8.78 9.90 10.52 8.14 9.68 10.03 8.50 10.33 7.80 8.30 10.05
[469] 8.53 9.47 8.70 7.76
Columns from data frames are essentially vectors. We can use all the operations and functions we can use for vectors (depending on their class.)
weather_data$mean_temp[1] # For example, we can select an element of the vector
[1] 9.48
What if we want to add a new variable? Let’s create a variable named “young”.
weather_data$cold <- 0
# What does this do?
weather_data$cold
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[57] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[113] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[169] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[225] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[337] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[393] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[449] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Now, we want to recode “young” to 1 for people who are younger than 25.
Let’s look at the Measures of Central Tendency and Variability from the lecture (starting at slide 16).
Consider the following vector:
example_vec <- c(1, 2, 3, 4, 5)
How could we calculate the mean of example_vec?
We could simply calculate it “by hand”:
(1 + 2 + 3 + 4 + 5) / 5
[1] 3
But this is not very useful if we look at an actual vector in our data frame, e.g., mean temperature:
weather_data$mean_temp
[1] 9.48 9.35 9.29 8.13 10.54 3.61 10.31 8.98 9.04 11.17 9.63 9.83 7.52 9.27 8.49 8.98 9.46 9.54
[19] 8.41 10.01 8.95 9.88 8.89 9.43 8.94 9.81 9.92 8.75 7.13 8.87 9.77 9.53 9.59 10.22 9.67 9.41
[37] 10.16 10.29 6.65 7.47 9.44 10.13 8.23 8.51 9.69 10.45 8.37 6.50 9.45 9.73 9.66 9.52 10.67 7.33
[55] 9.33 5.29 10.02 5.38 8.26 10.70 9.17 8.75 7.00 9.12 9.79 7.21 8.53 8.82 9.03 7.41 9.77 9.54
[73] 8.29 9.85 8.51 9.88 8.66 8.61 8.39 7.92 10.21 9.66 9.80 9.95 10.15 10.23 8.61 9.43 10.24 9.95
[91] 10.40 9.42 8.28 7.52 9.23 8.26 8.42 9.76 10.11 9.13 9.71 9.53 9.59 10.00 9.16 5.85 9.95 10.75
[109] 6.63 10.58 9.27 9.70 9.56 10.49 5.75 9.31 9.07 9.76 8.56 9.71 9.92 7.74 9.28 9.69 8.34 9.74
[127] 7.11 8.18 8.84 7.94 9.64 10.11 10.78 9.43 7.91 8.91 10.98 9.33 7.47 9.31 10.35 8.95 8.93 7.54
[145] 9.68 8.36 9.06 9.57 8.85 9.48 9.62 9.22 9.90 9.42 7.92 10.31 6.77 7.28 9.63 9.43 8.65 9.43
[163] 8.69 10.52 8.49 10.15 9.69 8.77 9.74 9.66 8.57 8.81 9.18 8.16 10.31 7.02 9.27 9.28 8.89 8.93
[181] 9.13 7.56 9.30 9.04 8.80 9.92 8.24 7.94 9.81 8.68 8.66 10.37 9.05 9.78 9.69 8.56 8.46 8.79
[199] 9.05 7.60 8.40 8.41 9.54 9.61 8.75 8.85 9.24 8.12 9.36 9.43 7.77 9.38 9.34 9.73 9.17 6.18
[217] 9.44 8.78 7.20 8.39 9.78 9.77 10.19 9.92 9.47 8.95 10.14 8.90 9.74 7.79 8.69 9.30 8.30 9.48
[235] 7.76 8.62 10.52 9.65 9.23 8.77 10.06 9.34 8.72 8.60 10.85 10.66 10.59 9.05 9.61 8.06 7.25 8.02
[253] 3.78 10.64 9.04 8.98 9.68 9.81 8.45 9.53 9.92 8.45 9.18 9.77 9.70 9.81 10.12 7.78 9.90 8.50
[271] 9.48 9.14 10.40 9.84 10.16 9.07 10.85 8.63 -4.05 8.87 10.20 9.15 9.13 8.87 7.90 9.57 7.11 8.62
[289] 10.00 9.64 8.95 9.58 9.81 8.58 9.98 8.82 9.59 9.09 10.02 9.33 10.24 10.58 10.32 9.40 8.22 10.25
[307] 6.11 7.44 6.39 6.91 8.16 9.56 8.97 9.35 9.49 9.38 8.36 10.33 7.57 9.57 9.48 8.15 10.16 8.52
[325] 8.18 5.74 10.84 10.42 9.97 10.22 9.84 9.66 10.87 10.99 9.69 8.56 8.29 8.23 9.42 7.15 8.72 9.16
[343] 8.69 6.83 9.68 8.77 9.22 8.94 8.16 9.10 10.01 9.39 9.17 9.53 10.55 11.37 9.92 8.26 9.50 8.62
[361] 8.63 6.49 9.89 7.87 8.62 8.90 9.56 9.66 8.68 8.90 10.06 9.50 9.83 9.52 6.90 9.35 5.09 9.90
[379] 9.83 8.91 10.22 8.67 9.80 10.12 8.65 9.54 8.98 8.75 9.30 3.81 8.71 9.13 8.26 10.88 8.47 9.13
[397] 9.61 8.49 9.36 9.71 8.45 10.23 6.90 9.29 7.41 9.57 8.02 10.48 9.18 8.68 9.88 9.50 9.87 9.93
[415] 7.37 7.96 9.16 9.66 9.25 8.57 9.98 9.34 7.59 8.37 9.23 9.84 7.94 9.44 8.87 9.42 10.13 8.70
[433] 9.44 9.40 8.26 8.81 8.91 9.76 10.07 8.02 10.76 8.91 10.61 9.32 9.19 9.21 9.31 8.62 9.04 9.35
[451] 8.75 9.96 9.75 10.06 9.71 7.16 9.79 8.78 9.90 10.52 8.14 9.68 10.03 8.50 10.33 7.80 8.30 10.05
[469] 8.53 9.47 8.70 7.76
Typing up all the entries individually would take a lot of time. We could use two functions that we already have seen, sum and length.
sum(weather_data$mean_temp) / length(weather_data$mean_temp)
[1] 9.037903
Fortunately, R provides a much easier way to calculate a mean:
mean(weather_data$mean_temp) # That was easy.
[1] 9.037903
But be sure that your vector is numeric. Could you calculate the mean of city?
weather_data$city
Let’s try to calculate the mean.
mean(weather_data$city)
Warning: argument is not numeric or logical: returning NA
[1] NA
It does not work! And even by hand we could not calculate the mean of character valued vectors.
Here is an overview over functions for measures of centrality and variability:
mean()median()var()sd()range()IQR()You can try them out here:
# Median
median(weather_data$mean_temp)
[1] 9.28
# Variance
var(weather_data$mean_temp)
[1] 1.566767
# Standard deviation
sd(weather_data$mean_temp)
[1] 1.251706
# Range
range(weather_data$mean_temp)
[1] -4.05 11.37
# Inter Quartile Range (IQR)
IQR(weather_data$mean_temp)
[1] 1.21
Unfortunately, there is no direct function to get the mode. The solutions you will find online are all a bit advanced. So the easiest solution is to look for the mode using a frequency table.
table(weather_data$cold)
0 1
409 63
The table() function shows you how often each value is
in the vector. You can now identify the most frequent value.
Let’s have a short look at our data again. Remember:
head() shows you the first six entries of your data. It is
very useful to get a look at the data structure when you have a lot of
rows in your dataset.
Now we can create a simple scatterplot:
plot(
x = weather_data$longitude,
y = weather_data$mean_temp
)
To get a nicer plot, we can adjust many things. We suggest to always explicitly make those adjustments in the same order.
plot(
x = weather_data$longitude,
y = weather_data$mean_temp,
type = "p", # This explicitly says that we want points. You could also try "l".
main = "Mean temperatures of German cities", # This adds a title to the plot
xlab = "Longitude (West - East)", # This labels the x-axis.
ylab = "Mean Temperature in 2021", # What does this do then?
las = 1, # This affects the tick labels of the y-axis.
pch = 19, # Here we choose what symbols we want to plot.
col = "black", # What color should the symbols have?
frame = F # No box around the plot.
)
We can also adjust the colors. Let’s highlight Mannheim!
Pro Tip: To color up your data visualizations, use the viridis-package.
Viridis colors make it easier to read by those with colorblindness and print well in greyscale. You probably don’t want to have plots like this:
We first need a vector that gives us the right colors with respect to the city variable.
library(viridis)
Loading required package: viridisLite
# we need two colors, this is how we get them:
two_colors <- viridis(2)
two_colors # these are so-called HEX color codes
[1] "#440154FF" "#FDE725FF"
# we use the first color for males and the second for females
mannheim_color <- ifelse(weather_data$city == "Mannheim", two_colors[1], two_colors[2])
# let's have a look:
table(mannheim_color)
mannheim_color
#440154FF #FDE725FF
1 471
Now we can use this vector to specify the color respective to Mannheim:
plot(
x = weather_data$longitude,
y = weather_data$mean_temp,
type = "p", # This explicitly says that we want points. You could also try "l".
main = "Mean temperatures of German cities", # This adds a title to the plot
xlab = "Longitude (West - East)", # This labels the x-axis.
ylab = "Mean Temperature in 2021", # What does this do then?
las = 1, # This affects the tick labels of the y-axis.
pch = 19, # Here we choose what symbols we want to plot.
col = mannheim_color, # Instead of just black we now use the color vector.
frame = F # No frame around the plot.
)
Now that we use different colors, we also need a legend to know which color is which.
plot(
x = weather_data$longitude,
y = weather_data$mean_temp,
type = "p", # This explicitly says that we want points. You could also try "l".
main = "Mean temperatures of German cities", # This adds a title to the plot
xlab = "Longitude (West - East)", # This labels the x-axis.
ylab = "Mean Temperature in 2021", # What does this do then?
las = 1, # This affects the tick labels of the y-axis.
pch = 19, # Here we choose what symbols we want to plot.
col = mannheim_color, # Instead of just black we now use the color vector.
frame = F # No frame around the plot.
)
legend(
"bottomleft", # Locate the legend in the topleft corner.
legend = c("Mannheim", "other"), # Give it labels.
pch = 19, # Specify symbols as in the scatterplot.
col = two_colors, # Specify colors.
bty = "n" # No box around the legend.
)
plot(
x = weather_data$longitude,
y = weather_data$mean_temp,
type = "p", # This explicitly says that we want points. You could also try "l".
main = "Mean temperatures of German cities", # This adds a title to the plot
xlab = "Longitude (West - East)", # This labels the x-axis.
ylab = "Mean Temperature in 2021", # What does this do then?
las = 1, # This affects the tick labels of the y-axis.
pch = 19, # Here we choose what symbols we want to plot.
col = mannheim_color, # Instead of just black we now use the color vector.
frame = F # No frame around the plot.
)
text(
x = weather_data$longitude[weather_data$city == "Mannheim"],
y = weather_data$mean_temp[weather_data$city == "Mannheim"],
labels = "Mannheim",
pos = 4
)
Now we want to visualize mean temperature with a histogram. This is how you get a standard histogram:
hist(x = weather_data$mean_temp) # That's intuitive, but does not look too great
Again, we can adjust many things to make it nicer.
hist(
x = weather_data$mean_temp, # For a histogram we only specify x.
breaks = 50, # specify the number of bins
main = "A Histogram",
xlab = "Mean temperature",
ylab = "Number of observations",
las = 1, # shift the y-axis labels
col = viridis(1), # One color only (first color from viridis)
border = "white" # That's the color of the bin borders.
)
We can also create density plots.
plot(
density(weather_data$mean_temp), # density() takes care of x, y and type.
main = "A Simple Density Plot",
xlab = "Mean temperature",
ylab = "", # The y-axis is not really meaningful here.
col = viridis(1),
lwd = 2, # Control the width of the line
frame = F,
yaxt = "n" # Remove the y-axis.
)
And we can also fill the are underneath the curve:
plot(
density(weather_data$mean_temp), # density() takes care of x, y and type.
main = "A Simple Density Plot",
xlab = "Mean temperature",
ylab = "", # The y-axis is not really meaningful here.
col = viridis(1),
lwd = 2, # Control the width of the line
frame = F,
yaxt = "n" # Remove the y-axis.
)
polygon(density(weather_data$mean_temp),
col = viridis(1, alpha = 0.5) # same color but 50% transparent
)
boxplot(
x = weather_data$mean_temp, # As for histograms we only specify x.
main = "Boxplot of Mean temperature in degree Celsius",
ylab = "Mean temperature in degree Celsius",
las = 1,
col = plasma(1),
frame = F
)
Or a horizontal boxplot.
boxplot(
x = weather_data$mean_temp,
horizontal = T, # With horizontal = T we rotate the boxplot.
main = "Horizontal Boxplot of Mean temperature in degree Celsius",
xlab = "Mean temperature in degree Celsius",
las = 1,
frame = F
)
You learned in the lecture that boxplots have some disadvantages.
Violin plots are a very nice alternative!
This is how you get them:
library(vioplot)
vioplot(
x = weather_data$mean_temp,
horizontal = T, # With horizontal = T we rotate the boxplot.
main = "Horizontal Violinplot of Mean temperature in degree Celsius",
xaxt = "n",
xlab = "Mean temperature in degree Celsius",
bty = "n",
axes = FALSE,
names = "",
border = NA
)
This has been a lot but now it’s finally your turn! We have a series of exercises for you to try out the stuff you just learned.
Pro tip: Copy the lines of code that worked for something similar. Then, adjust the code according to your problem. That’s how coding works most of the time!
Create three objects:
1. `my_lucky_number` should contain your lucky number.
2. `my_firstname` should contain your firstname.
3. `my_lastname` should contain your lastname.
After you created the objects, call them separately. Don’t forget to add comments to your code.
Create two vectors vec1 and vec2.
vec1 should contain 1, 56, 23, 89, -3 and 5 (in that
order).vec2 contains 24, 78, 32, 27, 8 and 1.Now select elements of vec1 that are greater than 5
or smaller than 0.
Next set vec1 to zero if vec2 is
greater than 30 and smaller or equal to 32.
Please solve all three steps in the next code chunk.
Now we will work with the weather_data data set. It is
already loaded for you and you can use it right away.
Show the variable age if it is over 60.
Generate a new variable and call it hot that is zero
for mean temperature < 10 and 1 for mean
temperature > 10 degree Celsius.
Have a look at your data set.
Please solve all three steps in the next code chunk.
Can you find the hottest and coldest city in Germany 2021?
Hint: The functions min() and max() help
you to find the minimum and maximum values of a vector or variable.
Combine that with your newly learned subsetting skills and you’ll find
the answer.
We will continue working with the weather data set
latitude.latitude.latitude.What we learned in this session:
The first lab session and this script should equip you with all the tools (and lines of code) to tackle the first homework assignment.
Copy the lines of code that worked for something similar. Then, adjust the code according to your problem.
Substantially, in your homework you will inspect a data set on US presidential elections. You will calculate some measures of central tendency and variability. Finally, you will produce some nice plots.
It is best to get started with your homework as soon as possible (after it was handed out on Tuesday).
Try to write the R Code first. We will provide you a
.Rmd template to hand in your results.
In order to pass the homework assignment you need to tackle ALL problems of a problem set. For a pass you also need to get most of the problems right (or at least show us that you tried everything to get it right.)
If you have any questions concerning the lecture or the tutorial please post them to the ILIAS forum or on Slack. We will answer them on a regular basis.
Do not hesitate to come to the office hours!
And always remember if you have a question, it is never a stupid question. In fact most of your fellow students probably have the same or a similar question. By asking it, everyone in this class will profit.